Gz 2690 Metamodels for Data Quality Description
نویسنده
چکیده
Data quality descriptions are crucial, but methods to produce and use them have not significantly improved during the past 10 years. Current quality descriptions are from the perspective of the producer of the data, not the user. Actual quality descriptions are mostly verbal and not suitable for rapid comparison with a required standard to make a decision about ‘fitness for use’ of a certain dataset for a task. This limits business with geographic data over the net. The paper introduces the concept of a metamodel as a framework to compare data quality from a producer and a user perspective in a single model. It is based on category theory and morphisms, which link the model of reality with the model of the GIS data, and their collection and use. The achieved quality of a decision based on using the data can be derived. It is shown that data quality descriptions are dependent on the intended use of the data. A ‘use independent’, generic data quality description is not possible. Fortunately, a large set of GIS functions demand the same data quality description, therefore not every potential use requires a different data quality description of a data set. 1 Current Data Quality Descriptions Are Inadequate Data quality descriptions are crucial for a flexible use of GIS. They are the key to the development of a sizable commerce in GIS data. Data quality descriptions are necessary for differentiated marketing strategies, and in particular product differentiation (Frank 1995; Frank 1996) . They are also the key to limit liability of data producers. Current practice of data quality description is inadequate and does not help users to decide if a potentially useful dataset should be acquired and used. Overall, data quality descriptions have not improved much in the past 10 years. The publication of the data quality description in the Spatial Data Transfer Standard (Morrison 1988) and the report of the NCGIA Specialist Meeting in Santa Barbara 1988 (Goodchild and Gopal 1989) document the research frontier then. The list of parameters to describe geographic data has not changed in the past 15 years; there is nearly no difference between the parameters listed in (Robinson and Frank 1985) and the lists published today (Stanek and Frank 1993). There is no progress to define quantitatively the quality of GIS data beyond positional accuracy for well-defined points. Data exchange standards and the practice of data producers rely heavily on lineage description as replacement for an effective, objective description (Chrisman 1991). There is no sizable discussion of data quality transfer functions which link data quality of inputs to the quality of the results (for a case study see (Zeitlberger 1997). This paper first reviews the overall situation of data sharing between organisations where data quality descriptions are necessary. It then reports on a case study assessing the currently available data quality descriptions for a number of data collections and shows that these descriptions are formulated from a data producer perspective, but are not suitable to answer the potential data user’s questions. In the second part, a formal model of the relations between reality, geographic data in a GIS and its use is set up. This metamodel allows to describe the situation using morphism. This model contains • an observation function linking reality to the data collected, and • the decision process which links to a decision, which is termed ‘function of interest’. A user’s query is a typical example of a function of interest which extracts some data from a database and the result is used in the decision process. For a data quality description to be usable, it must be possible for the user to link the data quality statement given for the data used to the quality of the results it deduces from the database. This presupposes a ‘data quality transfer’ function related to the function of interest. Each function of interest will in general have its own data quality transfer function. For each function of interest, a corresponding error function can be established and the relations between observation function, function of interest and the error functions involved described. The relations can be formalized and a simple example is given in this paper. This paper deals with the data quality description, which forms the input into the data quality transfer function of a user and demonstrates that there is a linkage between the data quality description and the ‘function of interest’ (respectively, the data quality transfer function). A single data quality description is not sufficient for all functions of interest in a GIS. From this follows that data quality descriptions must be made to suit the intended use of the data. A fully general data quality description which is useful to decide on fitness for any potential use, is not possible. But neither is a specific data quality description for every possible use necessary: large classes of operations on spatial data can use the same data quality descriptions. The determination of these classes is an important open research question. 2 Why Data Quality Descriptions? 2.1 Underlying assumption: data sharing Data quality descriptions are only necessary if data is collected by one organisation and used by another (Frank 1992). If as customary in the past the same organisation collects and uses the data, a description of the data quality is not necessary. The user can directly influence the collection process and adjust it so the data are suitable for his needs, and for the process of collecting and processing minimal cost will accrue. If a direct feedback loop between data user and data collector within an agency is missing, then an explicit consideration of data quality is necessary to minimize data collection cost, while fulfilling the data user’s requirements. When one organisation collects data and another one uses them, the producer must describe the quality of the data collected and the user prescribe the quality required for the task. From this follows that the data are fit for the intended use if the quality of the data as produced is better than the required quality. Unfortunately, today a decision about the fitness for use cannot be made without the user having a very significant understanding of the processes used for the data collection; typically a discussion between data producers and users is necessary to reach a conclusion about fitness for use. 2.2 Regular GIS assumption: sharing reduces cost The discussion pointing to the need for data quality description is situated in the regular GIS assumption, which is that data is expensive to collect and maintain, but data once collected can be used for many purposes. Cost reduction in avoiding duplication in data collection initially is important, but cost reduction later by avoiding the duplication of the maintenance of the data is usually much larger. Data sharing is a crucial concept for GIS (National Research Council 1980). The argument is more complex than the simple argument for reduction in cost of administration as it was initially put (Clapp, Moyer, and Niemann 1988; Gurda et al. 1987). If each agency uses its own dataset collected and maintained individually these data sets will necessarily differ within the error margins set for their collection process. This results in problems: • in border line cases, two decisions based on these data collections may contradict each other; citizens affected by these decisions become aware of the errors, loose faith in the administrative decision process, etc. • the integration of other data related to the spatial data sets cannot be integrated quickly, because the spatial reference objects are not the same. 2.3 Data quality description crucial for geographic data business Data quality descriptions should assure the user that the data are fit for the intended use (‘fitness for use’ (Chrisman 1983)). If the quality described is better than required, the data can be acquired and used for the intended task. Data quality descriptions are necessary to facilitate the emerging commerce in geographic data: The data quality description is part of the metadata which is part of the information a data producer makes available to prospective users of his data. Clear description of data quality brings: • More use of data: The decision about fitness for use by a possible user (buyer) of data can be made quickly and objectively. As more data is made available using the Internet, users (or programmed agents on behalf of the user) must be able to make automatically a decision which data set can be used for a particular task (Voisard and Schweppe 1996). • Limits to liability: If a data set is labelled by the producer with an objectively measured data quality, users who use the data for uses which cannot be achieved with this quality are clearly warned. This effectively limits liability by the producer for damage resulting from errors in the data larger than acceptable within the stated quality. Data quality descriptions are necessary to assign liability who is liable for damage occurring as a consequence of using the data? Did the producer deliver data which was of lower quality and thus caused the damage (and incurs possibly a legal liability) or were the data according to the quality asserted, but errors in processing caused the damage (for which the user would be responsible)? 3 Critique of Current Description Methods Data quality descriptions should be • independent of production method, • operational, and • quantitative. By the first we mean that the description of the data quality must not refer to the production method, but use a neutral formulation independent of how the data were produced. Second, a data quality description is operational if it can be used in a formalised (automated) process and does not depend on human interpretation of the terms. Third, the data quality must be measured on an (at least) ordinal scale, which allows the comparison of the quality of two data sets. 3.1 Case study metadata according to standards A class of surveying engineering students collected metadata on commercially available data sets in Austria, described the metadata according to the CEN (Comité Européen de la Normalisation) metadata standard, and collated the metadata in a commercial database (Timpf, Raubal, and Kuhn 1996). The goal of the study was twofold. First, the students gave feedback to the responsible working group at CEN about the usage of the metadata standard. Second, they assessed if potential users of the metadatabase can understand the information provided and if it is sufficient for decision-making by professional users. The usability of the metadata was low: they found that the metadata described the data from the data producer’s point of view and did not help the user to make a decision about the suitability of a data set for an intended task (Timpf and Frank 1997). Users need information on a higher level of abstraction, the information given often was too detailed and too confusingly presented. Most importantly they need to know what operations are supported by the data. 3.2 Data quality descriptions are producer oriented The data quality descriptions are most often provided as ‘lineage’, which describes the process that was used for the collection and processing of the data. This implies a very detailed and accurate description of the quality of the data. The description is objective as the same process can be duplicated and should result in a data collection with similar quality characteristics. The description of data quality as lineage is easy for the producer it is knowledge which is available: one just describes what one has done. It shifts the burden of the interpretation of the data quality description and the decision about fitness for use to the potential user of the data, who must make the connection between the production method unknown to him and his intended use. 3.3 Data quality descriptions are not operational A description can be called operational if there are standardized procedures, which can be used to determine the data quality values. These can be used without requiring interpretation. Operational methods are described in various standards to measure hard to determine values for ‘noise production of a car’, ‘intellectual ability’ etc. In every case, a well defined set of observation methods in a completely determined environment are used to assess the property, e.g., the procedures used to determine the SAT (Scholastic Aptitude Test) scores for entering students. The result of operational procedures are comparable, even if the individuals measured are incomparable. Data quality descriptions using lineage descriptions are not operational: two data sets can have very similar characteristics, but result from different data collection efforts (e.g., photogrammetry or field survey) and their lineage descriptions are very different. The comparison of lineage descriptions is difficult and requires intimate knowledge of the different data collection methods and the applied techniques, instruments used etc. Given a dataset of unknown origin, a lineage description cannot be produced. This implies that a lineage description given cannot be tested by the user independently of the data producer; one can only check the production records to see if the stated method was followed correctly. 3.4 Data quality descriptions are not quantitative Data quality descriptions should be quantitative measures of the quality of the data. This is necessary so that a user can decide if the quality provided is better than the minimal quality required for a particular use. For a decision about fitness for use, qualitative quality description on an ordered scale is sufficient. To allow the propagation of data quality through the potential user’s data analysis, data quality measures should be on a ratio or absolute scale (Stevens 1946). Operational, quantitative measures for data quality are only given for the positional accuracy for sharply defined points (RMS error); statistical methods to determine a sample etc. are well known (Cressie 1991). These quality descriptions can be used to predict the quality of derived values, applying the law of error propagation. 4 Data Quality Description as ‘Product Specification’ Data quality descriptions are like other product specifications: They describe the property of the goods to be exchanged (in this case data) such that the user can decide if the goods are ‘good enough’ for the intended use and the producer can point to the asserted properties. The limits are important in case a problem occurs where the user claims that the good caused damage or was faulty and had to be replaced. Product specifications are typically written as limits of the intended use or describe simple properties of the good which are of importance to the user: operating ranges for temperature or humidity, weight of the product, speed of processing etc. The specifications are written • in terms relevant for the use of the product rarely do they describe the methods of manufacturing the product, • the specifications are quantitative values measured with an accepted method, such that a decision can be made if the product is within these limits or not. Both these points are not fulfilled for data quality descriptions. 5 A Metamodel as a Framework Data quality seems to be an elusive concept. The clear idea of a ‘fitness for use’ decision is hard to operationalize as a comparison of two figures on an ordinal scale. The difficulty with the formalization of data quality is due to the definition of data quality as ‘correspondence to reality’. Data of high quality corresponds well to reality, for data of low quality, the deviation from reality is larger. The approach here is centered around functions and the composition of functions. Category theory (Asperti and Longo 1991; Walters 1991) is ‘algebra with functions’. The objects are functions and the only operation is composition. h = f . g means the function h which results from applying g to a set of data and applying f to the result; it can be written as h (x) = f (g (x)). Category theory often uses diagrams, which show functions as arrows leading from a domain to a co-domain. Diagrams are said to commute if f . g = h . j (see Figure 1).
منابع مشابه
A Modular Reference Structure for Component-based Architecture Description Languages
Metamodels are used to define languages, code generation and they serve as data structures for metamodel-centric software systems. In software engineering, these metamodels are crafted, evolved and extended, e.g., by further quality dimensions or structural features. However, an ad-hoc modeling approach does not properly support metamodel reuse by extension or composition. Nor does it enforce a...
متن کاملTowards the Formalisation of the TOGAF Content Metamodel using Ontologies
Metamodels are abstractions that are used to specify characteristics of models. Such metamodels are generally included in specifications or framework descriptions. A metamodel is for instance used to inform the generation of enterprise architecture content in the Open Group’s TOGAF 9 Content Metamodel description. However. the description of metamodels is usually done in an ad-hoc manner with c...
متن کاملAutomated Generation of Metamodels for Web service Languages
Abstract: Recently, the application of the MDA to Web services has received considerable attention. In the MDA, models are instances of the MOF based metamodels. Model Transformation, which is a key feature of the MDA, can carried out via defining Transformation Rules between two MOF compliant metamodels. As a result, finding MOF compliant metamodels for languages is an essential prerequisite f...
متن کاملChallenges in evolving Metamodels
Like every other software artifact, metamodels are subject to change even in later phases of the software life cycle. In this problem description paper, we first classify metamodel changes. We then elaborate on the challenges of metamodel evolution. The main challenges are the tight coupling of code to metamodels and the pervasiveness of metamodel dependencies. As this is a problem description ...
متن کاملQuality Attributes for Software Metamodels
As Model-Based Engineering (MBE) starts to be effectively used, some of its constituent disciplines such as Domain Specific Modeling and Metamodeling are becoming quite popular. Metamodels play a cornerstone role in these approaches, not only for defining Domain Specific Languages but also for specifying all kinds of artifacts involved in MBE. However, there is a lack of appropriate quality mod...
متن کاملMetamodels and Information Systems Engineering: a UML-based Approach
Metamodels play a major role in most modeling environments. Motivated by a survey of modelers’ practice, we show that metamodels are not a suitable description media for modelers. We propose to build a UML-based metamodeling architecture which encompasses modeling paradigms (multilingual and informal descriptions) and metamodels (monolingual and formal descriptions).
متن کامل